Goto

Collaborating Authors

 student classifier


Your Image Generator Is Your New Private Dataset

Resmini, Nicolo, Lomurno, Eugenio, Sbrolli, Cristian, Matteucci, Matteo

arXiv.org Artificial Intelligence

Generative diffusion models have emerged as powerful tools to synthetically produce training data, offering potential solutions to data scarcity and reducing labelling costs for downstream supervised deep learning applications. However, effectively leveraging text-conditioned image generation for building classifier training sets requires addressing key issues: constructing informative textual prompts, adapting generative models to specific domains, and ensuring robust performance. This paper proposes the Text-Conditioned Knowledge Recycling (TCKR) pipeline to tackle these challenges. TCKR combines dynamic image captioning, parameter-efficient diffusion model fine-tuning, and Generative Knowledge Distillation techniques to create synthetic datasets tailored for image classification. The pipeline is rigorously evaluated on ten diverse image classification benchmarks. The results demonstrate that models trained solely on TCKR-generated data achieve classification accuracies on par with (and in several cases exceeding) models trained on real images. Furthermore, the evaluation reveals that these synthetic-data-trained models exhibit substantially enhanced privacy characteristics: their vulnerability to Membership Inference Attacks is significantly reduced, with the membership inference AUC lowered by 5.49 points on average compared to using real training data, demonstrating a substantial improvement in the performance-privacy trade-off. These findings indicate that high-fidelity synthetic data can effectively replace real data for training classifiers, yielding strong performance whilst simultaneously providing improved privacy protection as a valuable emergent property. The code and trained models are available in the accompanying open-source repository.


Synthetic Image Learning: Preserving Performance and Preventing Membership Inference Attacks

Lomurno, Eugenio, Matteucci, Matteo

arXiv.org Artificial Intelligence

Generative artificial intelligence has transformed the generation of synthetic data, providing innovative solutions to challenges like data scarcity and privacy, which are particularly critical in fields such as medicine. However, the effective use of this synthetic data to train high-performance models remains a significant challenge. This paper addresses this issue by introducing Knowledge Recycling (KR), a pipeline designed to optimise the generation and use of synthetic data for training downstream classifiers. At the heart of this pipeline is Generative Knowledge Distillation (GKD), the proposed technique that significantly improves the quality and usefulness of the information provided to classifiers through a synthetic dataset regeneration and soft labelling mechanism. The KR pipeline has been tested on a variety of datasets, with a focus on six highly heterogeneous medical image datasets, ranging from retinal images to organ scans. The results show a significant reduction in the performance gap between models trained on real and synthetic data, with models based on synthetic data outperforming those trained on real data in some cases. Furthermore, the resulting models show almost complete immunity to Membership Inference Attacks, manifesting privacy properties missing in models trained with conventional techniques.


Learning from Higher-Layer Feature Visualizations

Nikolaidis, Konstantinos, Kristiansen, Stein, Goebel, Vera, Plagemann, Thomas

arXiv.org Machine Learning

Driven by the goal to enable sleep apnea monitoring and machine learning-based detection at home with small mobile devices, we investigate whether interpretation-based indirect knowledge transfer can be used to create classifiers with acceptable performance. Interpretation-based indirect knowledge transfer means that a classifier (student) learns from a synthetic dataset based on the knowledge representation from an already trained Deep Network (teacher). We use activation maximization to generate visualizations and create a synthetic dataset to train the student classifier. This approach has the advantage that student classifiers can be trained without access to the original training data. With experiments we investigate the feasibility of interpretation-based indirect knowledge transfer and its limitations. The student achieves an accuracy of 97.8% on MNIST (teacher accuracy: 99.3%) with a similar smaller architecture to that of the teacher. The student classifier achieves an accuracy of 86.1% and 89.5% for a subset of the Apnea-ECG dataset (teacher: 89.5% and 91.1%, respectively).


Knowledge Distillation with Adversarial Samples Supporting Decision Boundary

Heo, Byeongho, Lee, Minsik, Yun, Sangdoo, Choi, Jin Young

arXiv.org Machine Learning

Many recent works on knowledge distillation have provided ways to transfer the knowledge of a trained network for improving the learning process of a new one, but finding a good technique for knowledge distillation is still an open problem. In this paper, we provide a new perspective based on a decision boundary, which is one of the most important component of a classifier. The generalization performance of a classifier is closely related to the adequacy of its decision boundary, so a good classifier bears a good decision boundary. Therefore, transferring information closely related to the decision boundary can be a good attempt for knowledge distillation. To realize this goal, we utilize an adversarial attack to discover samples supporting a decision boundary. Based on this idea, to transfer more accurate information about the decision boundary, the proposed algorithm trains a student classifier based on the adversarial samples supporting the decision boundary. Experiments show that the proposed method indeed improves knowledge distillation and achieves the state-of-the-arts performance.